-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[opentitantool] Run "nice"r sims #16913
Conversation
Fix some timing issues by running sims with nice -5 to ensure test harnesses don't get starved of threads. This should solve some batch testing problems with a large number of threads serving verilator simulations. Signed-off-by: Drew Macrae <[email protected]>
@@ -35,9 +35,10 @@ pub struct Subprocess { | |||
impl Subprocess { | |||
/// Starts a verilator [`Subprocess`] based on [`Options`]. | |||
pub fn from_options(options: Options) -> Result<Self> { | |||
let mut command = Command::new(&options.executable); | |||
let mut command = Command::new(String::from("nice")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't particularly like changing the command from the executable to 'nice', but it's the best way I know to deprioritize this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need the String::from
? The docs for Command::new don't use it, but I didn't check the version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "nice" is a "slice" and new needs a String, This is something I did to make OTT build happily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave it a shot without wrapping in String::from
and didn't have any trouble compiling. I admit I don't entirely understand the AsRef<OsStr>
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I guess there was some other issue that I've since fixed. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we able to use setpriority()
or some other POSIX function instead of calling a separate program? I guess nice
will probably be available anywhere those are, though, so maybe the function is not any better. 🤔
Out of curiosity, how did you discover the thread starvation issue? Does it happen on CI? And how will we know that adjusting the nice value fixes the thread starvation? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % comments and questions. Thanks, @drewmacrae!
@@ -35,9 +35,10 @@ pub struct Subprocess { | |||
impl Subprocess { | |||
/// Starts a verilator [`Subprocess`] based on [`Options`]. | |||
pub fn from_options(options: Options) -> Result<Self> { | |||
let mut command = Command::new(&options.executable); | |||
let mut command = Command::new(String::from("nice")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need the String::from
? The docs for Command::new don't use it, but I didn't check the version.
let mut args = Vec::new(); | ||
|
||
args.push(String::from("-5")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Peeking at man nice
, doesn't -5
need to be preceded by -n
or --adjustment=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes it clearer we're increasing niceness but less concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, now I'm worried that nice
might be reading plain -5
as a parameter and silently ignoring it.
% nice -5 bash -c 'echo hello'
hello
% nice -n -5 bash -c 'echo hello'
nice: cannot set niceness: Permission denied
hello
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works on my machine
nice -5 yes > /dev/null & htop
-5 should be equivalent to -n 5 IIRC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make it more verbose and readable, I just want to check that it works before I push the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
positive nicing requires lesser permissions
Good point. Positive nicing is less favorable to the process, so I think -5
is the wrong direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's misleading so I'm changing it
nice -5 nice
:5
nice -n 5 nice
:5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a different sort of hyphen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I thought it was silently ignoring -5
, but it actually interpreted it as -n 5
. Got it.
I think my persistent confusion about this is evidence it should be a verbose -n 5
:P
@@ -35,9 +35,10 @@ pub struct Subprocess { | |||
impl Subprocess { | |||
/// Starts a verilator [`Subprocess`] based on [`Options`]. | |||
pub fn from_options(options: Options) -> Result<Self> { | |||
let mut command = Command::new(&options.executable); | |||
let mut command = Command::new(String::from("nice")); | |||
let mut args = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out-of-scope nit: Looks like Command::arg()
and Command::args()
exist. Given that command
is already mutable, I think we could probably eliminate the args
vector. Not super important and adds more noise to the diff, though.
Good question. I came across it with the custom resource allocation work I've been doing. I've tested it in this branch #16436 where I'd previously been allocating 8 cpus to the FPGA tests to prevent failures. This change lets us expect the FPGA tests to run on a single cpu instead. |
What is the situation where we are getting thread starvation for the test harness? And why are we getting ourselves into that situation? Are we sure there isn't something structurally wrong with the architecture of specific tests or of the dispatch strategy? This PR may merely be missing important motivating information here (especially observations / data), but I am skeptical that this approach is best. Messing with thread priority is playing with fire, and there very likely are situations where decreasing the thread priority of the simulator's threads is deleterious to the performance of tests. At the very least, I would oppose hard-coded constructions that are merely made convenient for the current CI environment. Perhaps consider adding a backend configuration option here? |
On my workstation I have 72 cores which means I can run 17 4:cpu verilator tests, and have 4 cores available to run fpga tests which in practice require 1 core. If these are simultaneously dispatched the fpga tests fail reasonably often, but of I run only 16 verilator tests it's healthy. The fpga tests don't require or use 8 cores, so while it's sufficient to rescue the tests I dislike reserving them for it. We also see some test-harness interactions causing problems in CI which may be resulting from this kind of resource contention but also might not. (We don't have a means to reproduce that issue) |
Nope, this is just an easier more elegant fix than the alternative method of accounting for resources. |
@a-will, Are you suggesting plumbing it through from rules/opentitan_test.bzl through an argument like --verilator-niceness? |
It feels like that means you can't run 17 verilator tests ;) (i.e. they actually require more than 4 cores). For tests that are time-sensitive, should we be allocating an entire core for the testbench / test harness? The original 4 cores only represent the DUT, and if we want to banish as much flakiness as possible, we may want to avoid loading a system enough to cause significant contention. That's the most conservative approach and leads to a less efficient use of resources, so it's just a question. There's some balance to be had between false failure probability and average throughput. So I wouldn't go with hard-coding this for everyone, but the backend configuration option might give you an opportunity to play with it. |
Yup, exactly! :) |
For me, this statement isn't as much of a deterrent as you might hope. |
Huh, interesting, this finding would argue for a value of cpu:4.5 (which is an upper-bound sufficient to prevent this condition.) It's more elegant and honest than saying FPGA tests take cpu:8 and should have the same effect on resource based allocation. |
There's a higher overhead to protect the verilator harnesses in CI and thus resolve #16831 this way. |
…d overhead An alternative to lowRISC#16913 to prevent harnesses from getting pre-empted by a simulation load. Signed-off-by: Drew Macrae <[email protected]>
Well #16915 is certainly simpler and does almost everything this PR does. |
…overhead An alternative to lowRISC#16913 to prevent harnesses from getting pre-empted by a simulation load. Signed-off-by: Drew Macrae <[email protected]>
…overhead An alternative to lowRISC#16913 to prevent harnesses from getting pre-empted by a simulation load. We also stop overriding resource based allocation by removing --local_test_jobs arg and update documentation Signed-off-by: Drew Macrae <[email protected]>
I think one of the bins of failure was a harness timing out waiting for a sim, so we really need less sims rather than nicer sims. |
…overhead An alternative to lowRISC#16913 to prevent harnesses from getting pre-empted by a simulation load. We also stop overriding resource based allocation by removing --local_test_jobs arg and update documentation Signed-off-by: Drew Macrae <[email protected]>
…overhead An alternative to #16913 to prevent harnesses from getting pre-empted by a simulation load. We also stop overriding resource based allocation by removing --local_test_jobs arg and update documentation Signed-off-by: Drew Macrae <[email protected]>
Fix some timing issues by running sims with nice -5 to ensure test harnesses don't get starved of threads. This should solve some batch testing problems with a large number of threads serving verilator simulations.
I think it might help for: #16831 (Though that may just be a port collision which could require additional machinery)
It protects both the verilator and the cw310 test harnesses from heavy sim workloads which will be important to running cw310 tests that aren't exclusive.
Signed-off-by: Drew Macrae [email protected]